Generating Music from Raw MIDI

Aug 2023 ~ Network Institute VU

Length:   11mo

Programming language: Python (NumPy, Random, Math, PyYAML, PyTorch, PyTorch Lightning, W&B, TensorBoard, Python Fire)

Data:  Raw MIDI representations composed of the following features: type (7 event types), note (128 notes), velocity (128 velocity levels), channel (16 MIDI channels), instrument (128 MIDI instruments), and tick (note on and off).

Problem description:
Develop and train GPT-like transformers on HPC to generate music from Raw MIDI

Approach & Results:
Starting from the decoder-only skeleton architecture displayed below, several structural adjustments were implemented, one by one, and evaluated after training the respective transformers on HPC and comparing their losses in W&B. The list of conducted experiments includes weight scaling, T-Fixup initialization, Stochastic Weight Averaging (SWA), Scalenorm and Fixnorm. Additionally, the model was developed to enable multi-node training and resume learning from checkpoints. Lastly, its embedding dimension, batch throughput, learning rate, and dropout rate were optimized, and the installation and utilization instructions were drafted.

Default Architecture

The performance of the transformers was quantified using the amount of information a model can compress per bit, measured in bits per event (the lower, the better). Regarding the enhancements incorporated, one that produced remarkable results is mixed precision. In the image below, one can see the validation losses of two models, out of which one uses mixed precision. Even though both transformers were trained for 120 hours, the one with mixed precision was three times faster without affecting the loss.

Validation Loss Mixed Precision

The next successful experiment involved adjusting the embedding initialization to PyTorch's default function, nn.Embedding(). Consequently, the validation curve was improved by 2%, as displayed in the next chart.

Validation Loss Data Augmentation

In order to reduce overfitting, various data augmentation techniques were applied, resulting in another significant decrease in the validation curve, as the figure below suggests.

Validation Loss Embedding Initialization

Finally, the best model was used to generate the following samples. These are composed starting with a seed extracted from an existing song, followed by a whistle, which marks the beginning of the music generated by the model. Accordingly, one can notice that the transformer can reliably produce chords and learn the timing from the seed.


  • Address

    Amsterdam, the Netherlands